Artist rendering of genome standards being applied to deciphering the extensive diversity of viruses.
Artist rendering of genome standards being applied to deciphering the extensive diversity of viruses.
Leah Pantéa for the Joint Genome Institute

Wind ripples across the surface of Wisconsin’s Lake Mendota, splashing waves against sailboats. Minute algae float through the water, fed by runoff from nearby farms and the surrounding city of Madison. But even in these tiny cells, there’s more to see. The algal cells are infected. Giant viruses have taken up real estate, using the algae’s cellular machinery to make more viruses. Zooming in closer reveals that the algae aren’t the only things infected. The giant viruses themselves are infected with tiny viruses called virophages. It’s a turducken of viruses.

The most abundant type of biological entity on Earth, viruses, live across all ecological niches. They affect how carbon moves through soil and water as well as what species are present in ecosystems. As Katherine McMahon, a microbiologist from the University of Wisconsin-Madison, said, their influence “ricochets up the food web.”

But we humans know very little about them.

Even though scientists supported by the National Science Foundation have studied Lake Mendota for decades, they were unaware of this elaborate part of the food web until recently. Microbiologists had focused on bacteria and other microbes that viruses infect.

Microbial studies in this location and others unintentionally formed a foundation for research into viruses at the Department of Energy’s (DOE) Joint Genome Institute (JGI). This research is identifying gene content — sections of DNA that organisms pass on to their offspring — from viruses in a variety of habitats. By analyzing these genes, scientists can better understand the impacts viruses have on their microbial hosts, their environments, and us.

A Viral Who’s Who

“There are several challenges in viral ecology right now, from the actual identification and classification of unknown viruses to their interactions with their host and environments,” said David Paez-Espino, a researcher at JGI, a DOE Office of Science user facility.

In the past, the traditional approach to identifying viruses was to cultivate them in the lab. But viruses that scientists can grow in the lab represent only a tiny fraction of viral diversity. Most cannot be cultured outside of their habitat. In contrast, researchers wanted to identify viruses that live in natural habitats all over the world.

But viruses don’t make it easy. They evolve quickly and independently from each other. Sequences of DNA often differ radically between groups of viruses, making it hard to sort out viral DNA from non-viral DNA. Even individual viruses in the same group can have DNA sequences so different from each other that it’s difficult to compare them.  

“[Viruses] are almost the last frontier of living things because they’re so diverse,” said Barbara Campbell, a Clemson University microbiologist. While her study of the Chesapeake Bay focused on microbes, her collaboration with JGI made that data available to researchers delving into viral diversity.

Viruses’ unusual characteristics make it hard for researchers to study them using the same techniques they use to study microbes. To study bacteria and other microbes in the wild, scientists like McMahon and Campbell take a sample from a natural environment—like a lake—then bring it back to the lab. Either at their lab or through a partnership with JGI, they separate and extract DNA from the sample. They then sequence the DNA and categorize the sequences. That assembled DNA forms a genome—a complete set of genes.

To identify the species of microbes, scientists compare the assembled genome content to genomes JGI has in its electronic databases. Even if there are no perfect matches, they can often find hits to genes that have similar sequences. Those similarities allow them to group the microbes into species or broader taxonomic groups.

But until recently, scientists’ databases had little information about viruses. Almost everything viral was unknown.

Discovering a Treasure Trove of Data

That status changed when JGI researchers developed new technology that enabled them to mine their existing database for viruses. JGI houses a wealth of data in its IMG (Integrated Microbial Genomes) database. IMG has more than 60 billion sequenced genes. The world’s largest publically available resource, IMG has genes from thousands of environmental samples, including the Lake Mendota project.

Back in 2014, JGI scientist Natalia Ivanova and Prokaryote Super Program Head Nikos Kyrpides identified a novel virus with a different genetic code than they had ever seen. In order to identify the bacterial host of that virus, they started searching through the database for sections of DNA that viruses leave behind in bacteria’s genomes after they infect them.

Skimming the data, the researchers happened across a unique sequence of DNA.

“This was incredible luck because it was looking for a needle in a haystack,” said Kyrpides. While these findings were being published, the JGI scientists explored whether by examining and matching up the sections of DNA that viruses left in bacteria, they would be able to identify untold new viral species. Unfortunately, this method kept coming up with genetic sections from non-viral entities. So they put that approach aside.

Next, they looked to the proteins that viruses produce. Viruses can’t produce proteins on their own. Instead, they mooch off the host’s metabolic machinery to produce proteins for them. Information on those viral proteins was buried in the database as well. It was just a matter of getting them out in a way that made sense.

Between the viruses they could cultivate in the lab and a selection of viruses from the wild, JGI scientists created a large catalog of viral proteins. They then used this catalogue as bait to fish out new viruses hidden in the JGI’s massive dataset.

This new approach to virus discovery allowed the researchers to completely re-evaluate their massive database. They re-analyzed more than 3,000 samples from all over the world, including those from Lake Mendota.

“It was this treasure trove of data that we could explore and dig into,” said Emiley Eloe-Fadrosh, a JGI researcher. “We are very much indebted to all of the collaborators the JGI has worked with.”

By the end of the study, Paez-Espino, Ivanova, Kyrpides, Eloe-Fadrosh, and other JGI researchers identified 125,000 viral sequences and could match 10,000 of them to their bacterial hosts.

“We know that our world is a microbial world. We know these viruses are shaping the microbial systems,” said Paez-Espino. “It was a great way to mine all of the data we had.”

So Much More than Scientists Imagined

Assembling the viral genomes — trying to complete each virus gene set — was just the beginning. It was like making a huge list of all of your relatives, but not knowing how any of them were related to each other. By looking at the similarities between genomes, the scientists were able to sort the viruses into about 80,000 groups.

Previously, the public databases held only about 4,000 DNA viral genomes. This study found 16 times as many.

“What we found was just many thousands of new viruses that had never been characterized before,” said Eloe-Fadrosh. “We found a much greater diversity than even the microbial populations that we had been studying for decades.” Follow-on studies have expanded that number to 750,000 viral sequences.

These diverse viruses infect an equally diverse array of bacterial and other microbial hosts. Matching up viruses with the hosts they infect is the first step in understanding where viruses fit into the food web and the role they play in an ecosystem. This is where the sections of DNA that viruses left in bacteria ended up coming in handy. By examining these leftover sections in bacteria and matching them up the viral genomes, the researchers were able to match up the viruses and their hosts. So far, the researchers have identified nearly 40,000 new host-virus relationships. Some viruses even infected a wide range of microbes.

The other surprise was the geographic spread of the viruses. While the same viruses showed up in similar ecological niches, these niches were often in far-flung parts of the globe that the viruses shouldn’t be able to travel between. Viruses from the same species-like group showed up in lakes on opposite sides of the world.

“The connectivity of different viruses across different environments was really something we didn’t expect,” said Eloe-Fadrosh. “This opened the door to thinking about how the biogeography of viruses is much broader than we expected.”

From the Giant to the Tiny

Now the data are even expanding our definition of what a virus is.

Viruses don’t have the ability to multiply their own. They rely on their hosts to do that. They also don’t have the ability to encode transfer RNAs, which other organisms use to build the proteins that researchers looked for. That’s always been one way scientists have distinguished viruses from traditional forms of life.

Giant viruses break this distinction. The “giant” refers to the fact that they have much larger genomes with more DNA than most viruses and many bacteria do. In addition, their genomes can encode transfer RNAs, which other viruses can’t. Using the same database, researchers found a giant virus with the largest genetic code and much larger translation machinery detected in a virus so far. They also identified six never-before-seen species-like groups of giant viruses.

Virophages are on the opposite end of the spectrum. While viruses infect microbes, virophages infect giant viruses. From the dataset in Lake Mendota as well as another lake in Wisconsin, JGI scientists identified 25 new sequences of virophages. That’s double the number of these types of virus than scientists had previously known.

The relationship between giant viruses and the viruses that infect them may play a major role in ecosystems like Lake Mendota. Just like a virus can make a human sick, virophages slow down giant viruses’ ability to reproduce. That slow-down may keep the giant viruses from effectively infecting other cells, like algae. In Lake Mendota, that lack of a brake on algal growth may influence the likelihood of toxic blooms. Algae compete against cynanobacteria, which produces these blooms. When cynanobacteria grows in excess, it can reduce the amount of oxygen in the lake and kill off fish. The identity and role of virophages could affect everything from water quality to Wisconsin tourism.

Diving into these viral genomes is just the start. As McMahon said about the work on Lake Mendota, “The closer we look, the more we realize that we don’t know or we don’t understand. I was impressed and surprised by the amount of different kinds of viruses that came out of our data.”


The Office of Science is the single largest supporter of basic research in the physical sciences in the United States and is working to address some of the most pressing challenges of our time. For more information please visit /science.

Shannon Brescher Shea
Shannon Brescher Shea ( is the social media manager and senior writer/editor in the Office of Science’s Office of Communication and Public Affairs.
more by this author