Repeatable pipeline for bee genomics research

In general, a population of individuals has genetic variation that creates phenotypic variation between them and confers a mix of positive and negative traits across the population. Using genomic techniques to determine what traits an individual may possess and how a species changes over time ultimately requires a way to describe how individuals vary from each other. A standard technique is to compare sequences from individuals to a reference sequence, and then use resulting patterns in genomic variation to further study evolutionary changes, predict traits for individuals and populations, and provide guidance to breeders and farmers.
Dr. Brock Harpur’s lab (Entomology, Purdue) studies evolution and mechanisms of adaptation in honey bees using genomic tools. A key starting point for many of their studies is sequencing samples from honey bees and comparing them to a standard reference honey bee genome to determine variation between the sampled honey bees. In addition, many other researchers have published sequences from their own honey bee samples to online genomic data repositories (NCBI/EBI) and being able to describe variation among these is also valuable. However, analyzing each sequence requires a multi-step and computationally intensive process, so processing even a fraction of the 5000+ published sequences is a daunting task.
Rather than tackling the genomic analysis manually, Dr. Harpur contacted Research Services for assistance in automating the multiple steps involved and implementing them in a project management framework so that progress could be monitored, and any problems could be easily resolved. The analysis solution had to (1) perform the analysis steps in order, (2) be able to handle many input sequences at once and work in parallel, and (3) monitor progress across the parallel analyses so that any issues could be resolved. Finally, the implementation had to be compatible with computing resources provided by the Rosen Center for Advanced Computing (RCAC) at Purdue.
ARGE Research Services worked with the Harpur lab to combine the individual manual analysis steps into a single pipeline in Nextflow – a framework used for managing pipelines, workflows, and providing detailed logging and progress tracking. The Harpur lab verified, then shared, the code they used to manually conduct each step in the analysis, from downloading the sequence to creating an analyzed product (file with gene variants). Research Services then implemented the steps in a Nextflow pipeline and developed a framework to handle multiple inputs, allowing for parallel processing of multiple sequences at one time. The pipeline was implemented and tested on RCAC’s Bell cluster in October 2022. Dr. Harpur said “It was great working with a flexible and skilled team that could adapt to my schedule while also working independently. The pipeline they’ve developed will be used by my lab and, hopefully, the field for many years to come. We’ve essentially set the standard for the field.”