About This Site
What is MD Repo?
MD Repo is an open repository for community-generated simulations of the molecular dynamics (MD) of proteins. It is designed to provide a home for millions of simulations accumulated over years of researcher effort, with expected eventual scale 10s of petabytes. Storage of simulated trajectories is intended to reduce redundant effort, improve reproducibility, and enable new discoveries and modeling techniques.
- Contribute: A researcher may submit any number of simulations to MD Repo, from a single trajectory to thousands. Submitted simulations are subjected to some post-submission validation and preparation, then stored in infrastructure backed by CyVerse . Each simulation is stored as a separate entry, with standardized metadata captured for each. MD Repo places no restrictions on the use or distribution of stored data.
- Explore: Visitors can search and explore simulations submitted by others. An individual trajectory can be downloaded directly from the website. Downloading a batch of simulations is performed with the MD Repo command line tool tool.
- New discoveries: We anticipate that these data will enable new discoveries via re-analysis of individual simulations and through development of new Machine Learning models designed to leverage the rich trove of training data.
Why is MD Repo necessary?
Considering the extensive use of molecular dynamics (MD) in research labs around the world, the large computational burden of individual simulation runs, the high value and reusability of resulting data, and the increasing drive for FAIR and TRUSTed data management, it seems remarkable that there does not yet exist an open repository for protein/drug MD simulations designed to capture and share community-generated data. There are many repositories focused on a particular system or type of molecule, but these are generally not designed to enable community contribution or to handle storage of petabytes of simulation data. MD Repo fills this void, with a design that leverages the scale and stability of the NSF-funded CyVerse ecosystem.
Moreover, modern Machine Learning strategies hold the promise of improving computational prediction of protein structures, interactions, and dynamics. These advances are poised to have a transformative effect across structural biology and drug discovery, but models to predict protein interactions demand increased volume and diversity of training data. MD Repo has been developed with goal of accumulating the necessary volume of training data. Over time, MD Repo will also provide the infrastructure necessary to enable researchers to work with data directly in the cloud, removing the need to download the massive dataset to personal compute environments.
Acknowledgements
We thank the University of Arizona Research, Innovation & Impact (RII) for supporting development of MD Repo through BIO5 and IT4IR TRIF Funds. MD Repo would not be possible without the capacity building infrastructure available through CyVerse, ACCESS, and JetStream2
How to cite MDRepo
Amitava Roy, Ethan Ward, Illyoung Choi, Michele Cosi, Tony Edgin, et. al. “MDRepo - an open environment for data warehousing and knowledge discovery from molecular dynamics simulations,” bioRxiv, Jul. 2024, paper(doi): 10.1101/2024.07.11.602903.