Statistical tool finds 'gaps' in DNA datasets shouldn't be ignored
A simple statistical test shows that contrary to current practice, the "gaps" in DNA protein and sequence alignments commonly used in evolutionary biology can provide important information about nucleotide and amino acid substitutions over time. The finding could be particularly relevant to scientists conducting research on distantly related species.
Biologists study evolution by looking at how DNA and protein sequences change over time. These changes can be sequence length changes — when specific nucleotides are deleted or added at certain positions — or substitutions, where one nucleotide type is exchanged for a different type at a given point.
"Think of the DNA sequence and its evolution as a sentence being copied by different people over time," says Jeff Thorne of North Carolina State University, a co-corresponding author of the U.S. National Science Foundation-supported research published in Proceedings of the National Academy of Sciences. "Over time, a letter in a word will change — that's a substitution. Leaving out or adding letters or words correspond to deletions or insertions."
The first step analysts usually perform when looking at evolutionary DNA changes is to construct a sequence alignment. This means figuring out how all the sequences correspond to one another and then aligning those corresponding positions into columns for comparison.
"Understanding these nuances will allow scientists to more accurately reconstruct the evolutionary history of organisms," says Leslie Rissler, acting director of NSF's Division of Environmental Biology.
Due to substitutions, insertions and deletions, however, nucleotide types within columns can vary among sequences or be absent altogether. When a sequence does not have a corresponding nucleotide, a gap is placed in the alignment column for that sequence.
"Conventionally, when using sequence alignments to do analyses, the gaps within alignment columns are treated as missing data that provide no information about the substitutions," Thorne says. "Historically, the research community has assumed that gap locations are independent of the substitution process. But what if that assumption is incorrect?"
Thorne and colleagues created a simple statistical test to assess whether gap locations are independent of the amino acid replacement process. They looked at 1,390 different sets of sequence alignments and found that in roughly two-thirds of the sets, the usual assumption of independence between gap locations and amino acid replacement was rejected.
"One possibility is that gap locations provide useful information about the amino acid replacement process," Thorne says. "If so, evolutionary biologists should develop better techniques for extracting this information."