Repeatable pipeline for bee genomics research

Dr. Brock Harpur’s lab (Entomology, Purdue) studies evolution and mechanisms of adaptation in honey bees using genomic tools. To understand how new genomic data from honey bees compare to sequence data from existing studies, Dr. Harpur wanted to create a library of honey bee gene variants from the 5000+ published sequences available in public databases and one that could be re-used every time a new genome is sequenced. Getting the final library of gene variants from many sequences is a multi-step and computationally intensive project that would have been extremely time-consuming to troubleshoot and manage manually. Dr. Harpur contacted Research Services for assistance in automating the multiple steps involved and implementing them in a project management framework so that processing progress could be monitored, and any problems could be easily resolved.
Processing the many honeybee sequences of interest from online genomic data repositories (NCBI/EBI) is a task that is challenging for several reasons. First, the various steps in the process, from downloading the sequence to creating an analyzed product (file with gene variants), must be implemented in a pipeline to provide some automation from step to step. Second, having thousands of small analysis jobs means that the work must be managed and monitored to ensure complete and quality end products. We worked with the Harpur lab to combine the individual manual analysis steps into a single pipeline in Nextflow – a framework used for managing pipelines, workflows, and providing detailed logging and progress tracking. Dr. Harpur said “It was great working with a flexible and skilled team that could adapt to my schedule while also working independently. The pipeline they’ve developed will be used by my lab and, hopefully, the field for many years to come. We’ve essentially set the standard for the field.”