Skip to main content

Bioinformatics recipes: creating, executing and distributing reproducible data analysis workflows

Abstract

Background

Bioinformaticians collaborating with life scientists need software that allows them to involve their collaborators in the process of data analysis.

Results

We have developed a web application that allows researchers to publish and execute data analysis scripts. Within the platform bioinformaticians are able to deploy data analysis workflows (recipes) that their collaborators can execute via point and click interfaces. The results generated by the recipes are viewable via the web interface and consist of a snapshot of all the commands, printed messages and files that have been generated during the recipe run.

A demonstration version of our software is available at https://www.bioinformatics.recipes/.

Detailed documentation for the software is available at: https://bioinformatics-recipes.readthedocs.io.

The source code for the software is distributed through GitHub at https://github.com/ialbert/biostar-central.

Conclusions

Our software platform supports collaborative interactions between bioinformaticians and life scientists. The software is presented via a web application that provides a high utility and user-friendly approach for conducting reproducible research. The recipes developed and shared through the web application are generic, with broad applicability and may be downloaded and executed on other computing platforms.

Background

The majority of bioinformatics analyses consist of several customizable computational tasks chained together to form a so-called pipeline or workflow. Publishing, documenting, and sharing these computational analyses are the cornerstones of reproducible research [1,2,3,4].

In this paper, we present a web application that allows bioinformaticians to publish and execute data analysis workflows. We call these workflows “bioinformatics recipes.” A “recipe” can be thought of as a standalone data analysis script that runs in a computing environment. A recipe may be a collection of several command-line tools, it may be a Makefile, a SnakeMake [3] file, a Nextflow [4] pipeline, an R script or any command line oriented computer program.

We designed our framework such that any series of commands may be formatted and published as a recipe. In addition, the application that we have developed can generate a graphical user interface to each recipe, thus facilitates user interaction and parameter selection at runtime.

Implementation

Our software is a Python and Django based application that can be installed and run with minimal system administration knowledge and is aimed to be deployed to serve individual research groups.

Our software also offers project-based laboratory data management. Within the management interface, all content is grouped into projects that may have public or private visibility. Content stored in public projects is readable without restrictions. Private projects will restrict access to members only. Within each project content is divided into three main categories:

  1. 1.

    Data (the input files)

  2. 2.

    Recipes (the code that processes the data)

  3. 3.

    Results (the directory that contains the resulting files of applying the recipe to data)

Figure 1 shows a project view with Data, Recipes and Results displayed in separate tabs of the project. A typical workflow requires that one or more Data are combined with a Recipe to produce a Result: Data + Recipe - > Results.

Fig. 1
figure 1

The recipe listing within a project. Each recipe has a short description as well as link to results produced by the recipe

First the data section must be populated. Data may be uploaded or may be linked directly from a hard drive or from a mounted filesystem, thus avoiding copying and transferring large datasets over the web. For recipes that connect to the internet to download data, for example when downloading from the Short Read Archive the data does not need to be already present in the local server.

Notably the concept of “data” in our system is broader and more generic than that for a typical file system. In our software “data” may be a single file, it may be a compressed archive containing several files or it may be a path to a directory that contains any number of files as well as other subdirectories. The programming interfaces for recipes can handle directories transparently and make it possible to run the same recipes that one would use for a single file on all files of an entire directory.

Each recipe may be assigned a graphical user interface specification code in TOML format. From the TOML code the recipe website will generate a user interface, connected to the underlying data analysis script. For example, the TOML code (partially shown) below:

would generate the interface shown in Fig. 2. When a recipe is executed the parameters selected on the graphical user interface will replace the corresponding parameters inside the recipe. The interface generation “specification language” provides the building blocks for creating user interfaces.

Fig. 2
figure 2

The graphical user interface for the recipe generated from the TOML specification snippet. This interface allows different parameters to be set and passed into the code of the recipe

The code for each recipe may be inspected before executing the recipe as seen in Fig. 3. Notably the recipe code consists of executable instructions that may be run on other platforms.

Fig. 3
figure 3

The recipe code consists of the computer code that is executed when the recipe is run. This code may be a series of shell commands, R code, Python code, or any scripting-oriented instruction set

Running a recipe on data entry produces a “result” directory. Result directories consists of all files and all the metadata created by the recipe as it is executed on the input data. Each run of a recipe will generate a new result directory. Users may inspect, investigate and download any of the files generated during the recipe run. Additionally, users may copy a result file as new data input for another recipe.

Upon executing a recipe on a dataset, a result directory is generated that lists all files created during the recipe run. See Fig. 4. In addition, all messages printed on the standard output or standard error streams are captured as files and may be inspected later.

Fig. 4
figure 4

The result interface shows all the files generated by a recipe run. In addition, the directory contains all the information necessary to reproduce the analysis, the code, the metadata and a log for the standard input and output generated during the execution of the recipe

The web application that we have developed also provides laboratory data management services. Recipes, data and results can be copied across projects, users may create new projects and may allow others (or the public) to access the contents of a project. As constructed, the web application provides a transparent and consistent framework to conduct analyses that can be shared among collaborators or with the public, and may be reproduced over time due the preservation of runtime-specific version of the code.

Discussion

The need to simplify access to command line tools via graphical user interfaces has been long recognized by several research groups. To address this need various frameworks with similar goals [3,4,5] have been proposed, developed and deployed. For example, Shiny [6] is an R package that provides a framework for turning R code into an interactive webpage. Webemboss [7] was proposed as a web based environment from which a user can make use of EMBOSS tools in a user-friendly way. Our approach differs substantially from each of these prior works and is aimed to serve the needs of different audiences. Conceptually the most similar software is Galaxy [8], a web application that deploys command-line oriented bioinformatics software tools via a web based graphical user interface.

The recipe approach is similar to Galaxy in that it serves non-technical audiences. Additionally, just like Galaxy, recipes provide a user friendly, graphical interface to facilitate their use. The main difference relative to Galaxy is that, every recipe is downloadable and executable as a standalone program. Thus, recipes may be run on any computational platform and do not depend on the web interface. We refer readers to the detailed documentation of software available at https://bioinformatics-recipes.readthedocs.io/ where we discuss in detail the differences between our approach and that of existing software platforms.

Notably, in our recipe approach, the roles are more separated and distinct than in Galaxy. In our typical use cases, bioinformaticians develop and test the analysis code at the command line, then they turn their code into recipes and share them with all collaborators. Once shared via the website, collaborators can then select parameters and execute a recipe using data of their choice. Collaborators may inspect, copy, and modify the recipe code.

Conclusion

Our software has been developed to provide bioinformatics support to metabarcoding analyses at the US Fish and Wildlife Northeast Fisheries Center and has been in operation for over a year. We found that the software is well suited for environments where bioinformaticians interact and collaborate with scientists from diverse backgrounds, and when consistent types of analyses need to be applied to varying datasets. In addition, we have found that the recipe approach integrates well into bioinformatics education. We have made use of the bioinformatics recipes website while delivering graduate level classes over several semesters at Penn State and found the approach to be well received by students. Using recipes allowed us to demonstrate the use of bioinformatics software in a manner that closely resembles their original command line usage. As instructors we were able to demonstrate complete workflows to students, show both the code and all the results that the code produced, while allowing students to copy, share and customize computational pipelines.

The current deployment contains a tutorial, education related materials as well as numerous recipes that demonstrate typical analytical workflows from quality control to RNA-Seq data analysis. We envision individual research groups and organizations running their private instances of our code to serve their local needs and audiences. Using this web platform to host the software allows the various bioinformatics analysis tools as well as the code used as part of the pipeline to be updated as new versions are available.

In conclusion, the we have developed a software that supports bioinformaticians assisting and collaborating with life scientists. The software is presented via a web application that provides a high utility and user-friendly tool for conducting reproducible research.

Availability of data and materials

A public deployment of the Bioinformatics Recipes software can be accessed at https://www.bioinformatics.recipes/.

The website contains recipes developed for US Fish and Wildlife as well as recipes used in online courses teaching bioinformatics. The code is released with an open source license and may be accessed at: https://github.com/ialbert/biostar-central.

References

  1. Strozzi, et al. Scalable workflows and reproducible data analysis for genomics. Methods Mol Biol. 2019;1910:723–45.

    Article  CAS  Google Scholar 

  2. Leipzig J. A review of bioinformatic pipeline frameworks. Brief Bioinform. 2017;18(3):530–6. https://0-doi-org.brum.beds.ac.uk/10.1093/bib/bbw020.

    Article  PubMed  Google Scholar 

  3. Federico A, et al. Pipeliner: a Nextflow-based framework for the definition of sequencing data processing pipelines, Front Genet. 2019;10:614. https://0-doi-org.brum.beds.ac.uk/10.3389/fgene.2019.00614.

  4. Tommaso P. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017;35:316–9.

    Article  Google Scholar 

  5. Köster J, Rahmann S. Snakemake--a scalable bioinformatics workflow engine. Bioinformatics. 2012;28(19):2520–2. https://0-doi-org.brum.beds.ac.uk/10.1093/bioinformatics/bts480.

    Article  CAS  PubMed  Google Scholar 

  6. W Chang et al (2019), Shiny: web application framework for R, https://CRAN.R-project.org/package=shiny.

    Google Scholar 

  7. Sarachu M, et al. wEMBOSS: a web interface for EMBOSS. Bioinformatics. 2005;18(4):540–1.

    Article  Google Scholar 

  8. Giardine B, et al. Galaxy: a platform for interactive large-scale genome analysis. Genome Res. 2005;15(10):1451–5.

    Article  CAS  Google Scholar 

Download references

Acknowledgements

Not applicable.

Availability and requirements

Project name: Biostar Central

Project home page: https://github.com/ialbert/biostar-central

Documentation: https://bioinformatics-recipes.readthedocs.io

Operating system(s): Platform independent

Programming language: Python 3.6 or above

Other requirements: Django

License: GNU GPL 3.0

Restrictions to use by non-academics: none.

Funding

This work has been supported by the US Fish and Wildlife Service Cooperative Agreement Award F16AC01007 as well as the Pennsylvania State University. The funds supported the cost of developing and deploying the software application and the cost of developing data analysis recipes for the Northeast Fisheries Center of the US Fish and Wildlife Service.

The findings and conclusions in this article are those of the author(s) and do not necessarily represent the views of the U.S. Fish and Wildlife Service.

Author information

Authors and Affiliations

Authors

Contributions

AM, CR, MB and IA conceived the project. NA and IA developed the software and wrote the paper. AS, AM and IA developed recipes for metabarcoding analysis. All authors contributed to the functional design and usability design of the software. IA supervised the research. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Istvan Albert.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

Not applicable.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Aberra, N., Sebastian, A., Maloy, A.P. et al. Bioinformatics recipes: creating, executing and distributing reproducible data analysis workflows. BMC Bioinformatics 21, 292 (2020). https://0-doi-org.brum.beds.ac.uk/10.1186/s12859-020-03602-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://0-doi-org.brum.beds.ac.uk/10.1186/s12859-020-03602-6

Keywords