For genetic researchers, key metadata is often missing

Often when researchers conduct genetic studies with grant funding, they’re required to make the sequence data public. But they aren’t required to report where the sample originated and when it was collected. Researchers say that lack of information results in missed opportunities for scientists to learn more about wildlife trends in a changing global climate.

“There are very few requirements to include metadata associated with these sequences,” said Michelle Gaither, an assistant professor at the University of Central Florida. “If a sequence comes from a sample collected in 1989 versus 2020, you’re going to want to know that, especially if you’re assessing genetic diversity of a population and how it changes over time.”

Even in data repositories that have an option to include that metadata, “it’s human nature to do the least amount of work required of you and move onto the next project,” she said.

Gaither co-authored a study led by postdoctoral research Rachel Toczydlowski published in Proceedings of the National Academy of Sciences in which they set out to determine how much of this data was missing from a collection of raw genomic sequencing on wild plants, animals and fungi across the globe known as the Sequence Read Archive of the International Nucleotide Sequence Database Collaboration.

Gaither and her colleagues turned to graduate students for help. Amid the pandemic, it gave them an extra source of income, Gaither said, and it gave her team extra help sorting through this enormous database for missing metadata. They found that 86% of the genomic information lacked GPS location, timestamps and other metadata.

To fill in the missing information, the team tried reaching out to the original scientists. Even then, they had trouble getting ahold of the authors, Gaither said, so they were only able to fill in about a third of the information in the datasets. “It took basically thousands of hours to even attempt it, and there’s still going to be a lot of data missing,” she said.

Gaither would like to see a new set of standards that would require scientists to log that data. That will help other researchers learn more about genetic diversity, she said, which may be particularly important to track species survival and the status of populations through climate change.

“Genetic data produced currently will become an historical baseline,” she said.

Header Image: Genetic information can help researchers document the diversity of wildlife species. But a lack of data on location data and time stamps in databases limits scientists’ ability to understand trends in a changing global climate. Credit: Bjørn Christian Tørrissen