Researchers interested in improving a given trait in plants can now identify the genes that regulate the trait’s expression without doing any experiments.
Purdue University’s Kranthi Varala, and 10 co-authors published the details of the new web-based regulatory gene discovery tool in the April 23 issue of Proceedings of the National Academy of Sciences. Varala has a patent pending on the results that relates to economically important seed oil biosynthesis.
The Purdue-USDA team sought to build a resource that learns, from large amounts of publicly available data, to quickly identify what special genes called transcription factors regulate the expression of a given trait in various plant species.
“Every study focuses on a handful of them,” said Varala, assistant professor of horticulture and landscape architecture. “Our premise was that if we can put all of it into a single analysis, then we can use this data to build something global.”
Arabidopsis served as the PNAS study’s model plant, “but this approach has nothing specific to Arabidopsis,” Varala said. “The approach is general enough that you could start with a corn dataset. You could do it with rice, with tomato, whatever crop you’re working on as long as you have thousands of gene expression measurements that people have done. And there are over a dozen species now where we have tens of thousands of gene-expression studies.”
To prove the system works, the team focused on a genetic pathway that regulates how plants make and store oil in their seeds. The team picked that trait because of its importance in food and biofuel production, and because more than 300 of the genes involved are already known.
By genetically manipulating a plant’s transcription factors, researchers can increase or decrease the amount of oil produced in its seeds.
Like other researchers, Varala has pursued many projects over the years where his goal was to identify the genes and regulators involved in solving one problem. This meant conducting careful, time-consuming experiments. But the data generated fell short of providing all the answers he sought. He compared it to working an equation knowing only three of the 10 factors involved.
“You can’t solve the equation,” he said. Likewise, Varala often wanted to ask more questions than the data could answer. That motivated him to build a framework that uses all possible data to ask those questions without having to do all the relevant experiments to obtain a list of candidates that then need genetic validation.
“I’m trying to short-circuit the initial data collection phase,” Varala said, so that scientists can focus on conducting the genetic validations. But to do so, his team had to begin with a dataset based on 18,000 individual studies.
Varala and his team analyzed this massive dataset using the Bell and the now-retired Brown supercomputers at Purdue’s Rosen Center for Advanced Computing. The team built a machine-learning framework to speed the process for others.
It would be impossible for one person to do this manually. A team could do it, but that would introduce biases in how group members process the data. The machine-learning classifier operates without bias.
The novelty of the approach is that instead of pulling data related to all organs, it focuses on organ-specific datasets. Independent gene networks regulate these organs — leaves, roots, shoots, flowers and seeds.
“Instead of using all organs, we said, within the seed experiments that people have done over the years, can we use all the data to learn something that’s happening in the seed and not necessarily the root or the leaf or the flower? That improved our approach a lot.”
The team used a computational method called the inference approach to predict what transcription factors were going to regulate the seed oil biosynthesis process in Arabidopsis.
“The ones we know help us validate that our approach is working correctly. The ones that we don’t know are good candidates for finding out new biology,” Varala said. “This purely computational approach knows nothing about seeds or oil or anything like that. We gave it a list of genes and it was able to rediscover the known ones without knowing any biological context.”
The lead author, Rajeev Ranjan , a postdoctoral researcher in the department of horticulture and landscape architecture at Purdue, took the other 12 of the top 20 and asked if those predictions are true. “We were able to generate mutant lines for 11 of those 12. Five of those 11 do change the seed oil content,” he said. “Further, we also showed that overexpression of one factor increases seed oil up to 12%.”
The eight known regulatory genes, added to the eight new ones, showed that the inference approach accurately identified 13 of the top 20 candidates. The strength of the approach is working only from a list of genes, it can predict with high accuracy which ones will regulate a trait of interest.
“It took a long time to do because it’s a long, complicated process, and there was no guarantee that it would work,” said Varala of the four-year project. “Nothing on this scale had been attempted before.”
Varala has disclosed the innovation to the Purdue Innovates Office of Technology Commercialization, which has applied for a patent to protect his intellectual property.
This research was supported by the U.S. Department of Energy Office of Science.