High Performance Computing with Coarse Grained Model of Biological Macromolecules

The Unified Coarse Grained Model of biological macromolecules (UCGM) that is being developed in our laboratory is a model designed to carry out large-scale simulations of biological macromolecules. The simplified chain representation used in the model allows to obtain 3-4 orders of magnitude extention of the time-scale of simulations, compared to that of all-atom simulations. Unlike most of the other coarse-grained force fields, UCGM is a physics-based force field, independent of structural databases and applicable to treat non-standard systems. In this communication, the efficiency and scalability of the new version of UCGM package with Fortran 90, with two parallelization levels: coarse-grained and fine-grained, is reported for systems with various size and oligomeric state. The performance was tested in the canonicaland replica exchange MD mode, with smalland moderate-size proteins and protein complexes (20 to 1,636 amino-acid residues), as well as with large systems such as, e.g., human proteosome 20S with size over 6,200 aminoacid residues, which show the advantage of using coarse-graining. It is demonstrated that, with using massively parallel architectures, and owing to the physics-based nature of UCGM, real-time simulations of the behavior of subcellular systems are feasible.


Introduction
The knowledge of the three-dimensional structure of macrobiomolecules is crucial for understanding many biological processes.Most biologically important molecules are too large to handle for the classical simulation tools.Therefore, coarse-grained models and force fields have become useful in these studies [37].Reduced representations of molecules allow saving computational cost of a few orders of magnitude in comparison with full-atomistic simulations.Various types of coarse-grained models such as the SICHO and CABS models have been developed [13,14].These knowledge-based protein models, developed in the Kolinski group, use only three degrees of freedom per residue.Compared to coarse-grained models of proteins [12], coarse-grained models of polysaccharides have a relatively short history.Most of them use three or four interaction centers per sugar ring [26,28,34], including the polysaccharide component of the well-known MARTINI force field [24].The MARTINI force field allows to simulate also other important biomolecules: lipids, proteins [29], DNA etc.[27] The Unified Coarse Grained Model of biological macromolecules (UCGM) [19] that is being developed in our laboratory is a physics-based model, designed to carry out large-scale simulations of biological macromolecules: proteins (the UNRES model [20]), nucleic acids (the NARES-2P model [7]), polysaccharides (the SUGRES-1P model), and lipid membranes.The chain representation in UCGM is reduced to two interaction sites (united peptide groups and united side chains in UNRES, united phosphate groups and united sugar-base groups in NARES-2P) or one interaction site (united sugar rings in SUGRES-1P).In UCGM the effective energy function was derived via generalized-cumulant expansion [17] of the potential of mean force of biopolymer chains [21,36].Therefore, it reproduces the structure, dynamics, and thermodynamics of biopolymers very well despite the significant reduction of the number of interaction sites.UNRES, which is the most advanced component of UCGM and has the longest history of development, has succeeded in physics-based protein-structure predictions [8,15], investigation of protein-folding [38] and studying biological processes, such as amyloid formation [33], and iron-sulfur cluster biogenesis [30].NARES-2P [7] can reproduce the structures and thermodynamics of small DNA and RNA molecules in free molecular-dynamics simulations and has also recently been applied with success to model the stability of telomere sequences of DNA [35].The preliminary version of SUGRES-1P [19] can reproduce the helical structure of amylose and the sheet-like structure of cellulose.
The article is organized as follows.Section 1 describes the UNRES model, force field, and coarse-grained molecular dynamics used in our study.In Section 2, we present details of implementation of UNRES with Fortran 90.In Section 3, we report the results and their analysis.The Conclusion section summarizes our study.

Theory 1.UNRES Model and Force Field
In the UNRES model (Fig. 1) [19], a polypeptide chain is represented as a sequence of αcarbon (C α ) atoms connected by virtual bonds, with united side chains (SC) attached to them (with different C α • • •SC bond lengths, dsc) and united peptide groups (p) positioned in the middle between the two consecutive C α s.The SCs and ps are the only interacting sites, while the C α s only assist in geometry definition.Backbone geometry of the simplified polypeptide chain is defined by the ).The local geometry of the ith side-chain center is defined by the zenith angle α i (the angle between the bisector of the respective angle θ i and the C α i • • •SC i vector) and the azimuth angle β i (the angle of counter-clockwise rotation of the C α i • • •SC i vector about the bisector from the ).The energy function consists of local (including virtual-bond-deformation, virtual-bond-angle bending, virtual-bond-torsional and double torsional, and virtual-side-chain rotamer potentials), pairwise site-site interaction potentials, and correlation (multibody) potentials [20].The solvent is implicit in the force field.UNRES was derived based on the potential of mean force (PMF) of polypeptide chains in water, which was factorized into Kubo cluster-cumulant functions [17], which are, in turn, identified with specific effective energy terms [21,36].Expansion of the cluster-cumulant functions into Kubo cluster cumulants [17] yielded approximate analytical expressions for the energy terms, including the correlation terms [21,36].The force field was calibrated with solution structural data of 7 traning proteins with various types of structure, by using the maximum-likelihood method recently developed in our laboratory [16].
Because of PMF-based origin of UNRES, the force field has the sense of free energy restricted to a given coarse-grained conformation rather than the potential energy.The temperature dependence was implemented in our earlier work [22]; it affects mostly the correlation terms, thereby contributing to weakening the regular structure elements with increasing temperature.

Molecular Dynamics and Replica Exchange Molecular Dynamics with UNRES
Molecular dynamics is the core of the main conformational-search engine with UNRES.The equations of motion are Langevin equations because of implicit solvent treatment; however, for faster search, the simple Berendsen thermostat [2] (which originates from application of the Langevin equations to a system) is often used [9,10]; the Nosé-Hoover and Nosé-Poincaré thermostats that enable more strict control of temperature conservation were also implemented [11].Because the virtual bonds are treated as stretchable rods, it was natural to select the C α • • •C α and C α • • •SC virtual-bond vectors as variables.With such a representation, the inertia matrix is a full, albeit constant, matrix.Recently, by selecting the C α and SC coordinates as variables, we reduced it to a pentadiagonal form (J. Sans and A. Liwo, to be submitted), this resulting in linear and not quadratic scaling the memory requirements with system size.A version of adaptive multiple-time-step velocity-Verlet algorithm was developed to integrate the equations of motion [9,31].
For improved conformational search, replica exchange (REMD) [6] and multiplexed replica exchange (MREMD) [32] extensions of MD were implemented with UNRES [4].In the MREMD method, a series of independent MD trajectories are run, each at a different temperature, with synchronization every pre-defined (usually 10,000 or 20,000) number of steps.At each synchronization point, the energies of the conformations from consecutive trajectories are compared, and exchange of temperature occurs following the Metropolis criterion.This approach gives a chance for high-energy conformations stuck at low temperatures to escape from high-energy regions because of overcoming energy barriers upon heating and, conversely, to low-energy conformations simulated at high temperatures to cool down and achieve even lower energy.The MREMD method differs from REMD in that more than one trajectory is simulated at a given temperature.

Parallelization
UNRES was parallelized on two levels: coarse-grained and fine-grained [23].At the coarsegrained level, each MD trajectory is handled by a given task or task group.This means trivial parallelization scheme when many canonical MD trajectories are run, or task-farm-type parallelization scheme for REMD/MREMD runs, with the master processor controlling the synchronization and exchange of replicas and the slave processors running only MD trajectories between the synchronization points.The REMD/MREMD UNRES jobs cannot be run in a serial mode, and the number of cores per job must be equal to the number of trajectories if coarse-grained-parallelization only is executed or a multiple of the number of trajectories for fine-grained parallelization.Because the management work is a small fraction of trajectory computations and the speed of calculations of all trajectories is virtually the same, the master processor also computes an MD trajectory.The efficiency of coarse-grained parallelization is close to 100 %.
To treat large systems, fine-grained parallelization was developed [23] in which energy and force evaluation was split between the tasks handling a given trajectory.The task-group master distributes coordinates to the slave tasks and collects the energy and force components from them; it also does its share of energy and force calculations.

Implementation with Fortran 90
Originally, UNRES was written in Fortran 77.Because of this, the code started to be more and more obscure as new functions were added to the package.Therefore, recently [25] we have rewritten the code in Fortran 90, grouping subroutines and data structures into clearly defined modules, which can also be shared by the programs for post-processing UNRES results (WHAM, which executes the weighted-histogram analysis method [18] given the results of REMD/MREMD simulations in order to compute ensemble averages and probabilities of conformations and CLUSTER to group the conformations into families [22]).This approach enabled us to reduce the redundancy in the code to a great extent.Moreover, the code takes advantage of dynamic memory allocation, this making memory usage more efficient, especially for 2-,3-and 4-dimensional arrays.The source code in the UNRES package with Fortran 90 has the following directory structure: • unres (main program modules, source code with all UNRES functionalities); -data (modules containing arrays declarations that correspond to COMMON block files in the UNRES package with Fortran 77); • wham (weighted-histogram analysis method source code); • cluster (cluster analysis source code); • xdrfpdb (format conversion programs source code).Because the postprocessing programs require a negligible fraction of compute time compared to that required for simulations with the main UNRES module, only the performance of UNRES will be discussed in this paper.

Results and Discussion
We tested the scalability of the Fortran 90 implementation of UNRES with proteins of different size and on various hardware platforms, which included our local Beowulf cluster composed of 20-core Dell nodes Intel Xeon CPU E5-2640 v4 2.40 GHz connected with Gigabit Ethernet, at Faculty of Chemistry, University of Gdańsk (piasek4.chem.univ.gda.pl), the 3214-node cluster, each node containing a 12-core Intel Xeon E5 v3 2.3 GHz processor (35,568 cores total), with Infiniband connection, at the Centre of Informatics -Tricity Academic Supercomputer & net-worK (CI TASK) in Gdańsk (tryton-ap.gv.kdm.task.gda.pl), the 1084-node Cray supercomputer, each node containing 2 Intel Xeon E5-2690 v3 12-core processors with Cray Aries connection (okeanos.hpc.icm.edu.pl), and the 1024-node IBM Blue Gene/Q supercomputer, each node containig 16 4-thread PowerPC A2 cores with 5D Torus connectivity (nostromo.hpc.icm.edu.pl), at the Interdisciplinary Centre for Mathematical and Computer Modelling, University of Warsaw.
The speed-ups vs. the number of cores for single-trajectory MD runs are displayed in Fig. 2, while the corresponding efficiency plots are shown in Fig. 3.These data pertain to singletrajectory MD runs, i.e., they illustrate the performance of the software in the fine-grained parallel runs.As expected, the speed up and efficiency increase with the number of residues in a system.For the smallest protein, only up to 20-times speed up can be achieved, while for large proteins over 50 % efficiency can be achieved with up to 24 cores or up to 128 cores with Blue Gene/Q.Except for Blue Gene/Q, which has fast interconnect, all cores used for fine-grained parallelization were contained in a single node.Use of cores from more than one node for fine-grained parallelization resulted in performance deterioration.For example, use of 48 fine-grained processors (2 nodes) on the Cray resulted in the same or lower speed up than that with 24 processors (a single node).
The fraction of communication time split into synchronization (barrier) and data exchange is plotted for the 5HKQ protein in Fig. 4 as a function of the number of fine-grained tasks.It can be seen that communication constitutes less than 10 % of the total wall-clock time and that the main fraction of it is spent on synchronization, which is understandable because the data are exchanged between the cores of the same hardware unit.
To test the efficiency of the parallel implementation of the code in full-blown calculation, replica-exchange runs consisting of 2 to 48 trajectories, single-trajectory runs taken as reference, were carried out for the 1A6S protein (87 residues).Each trajectory consisted of 400,000 steps and exchange of replicas occurred every 10,000 steps.The non-setup time (defined as the wallclock time after subtracting that for data input, initializing variables, and computing the inertia matrix) efficiencies as functions of number of trajectories (coarse-grained tasks) and different numbers of cores per trajectory (fine-grained tasks) are shown in Fig. 5.It can be seen from the figure that the non-setup time and efficiency effectively depend only on the number of fine-  The logarithmic scale with base 2 is used on both axes grained tasks per trajectory.This is because all MD trajectories are running at virtually the same compute speed, this providing good load balancing and data transmission occurs rarely (in the reported calculations only 40 times).
To compare UNRES with the all-atom simulations we have chosen one of the most popular molecular dynamics software packages, GROMACS [1,3].We have used the all-atom GROMACS 5.1.2program run on tryton-ap.gv.kdm.task.gda.pl for 1L2Y, 1A6S, and 5OMT proteins, and 10,000 steps of molecular dynamics with amber03 force field [5].The non-setup times for the above-mentioned proteins, with various numbers of processors are plotted in Fig. 6.As can be seen from this figure, the GROMACS computation time scales almost linearly with both number of processors and protein size.This confirmed that GROMACS is currently one of the most efficient and best load-balanced MD software (due, among other things, to the cut-Supercomputing Frontiers and Innovations off introduction on non-bonded interactions).As shown, UNRES provides a few times quicker simulations compared to GROMACS.Moreover, because the UNRES event-based time scale is at least 1,000 times longer than that of the all-atom approach, the effective speed-up with respect to GROMACS is even bigger.When the single-processor times are compared, the UNRES simulation time is 33 times lower for 1L2Y, 10 times lower for 1A6S, and 5 times lower for 5OMT.Summarizing, the simplified biopolymer-chain representation in UNRES results in several order of magnitude extension of the time-and size-scale of computations, compared to that of the all-atom simulations.

Conclusion
In this work, we implemented the Fortran 90 version of the UNRES package for coarsegrained simulations of proteins and protein complexes on a number of parallel platforms and Figure 6.Plots of non-setup times in 10,000 steps MD simulations with UNRES and GROMACS, with various numbers of processors obtained for the 1L2Y (20-residue), 1A6S (87-residue) and 5OMT (110-residue) proteins with tryton-ap.gv.kdm.task.gda.pl.The logarithmic scale with base 2 is used on both axes tested its parallel efficiency both in the fine-grained mode (energy and force parallelization in single-trajectory molecular dynamics calculation) and in the full-blown (coarse-and fine-grained) mode, in which multiple trajectories were run in replica exchange calculations with communication between the respective task groups every given number of steps, and parallelization of the computation of each trajectory.For proteins with chain length about 300 amino-acid residues or more over 50 % parallel efficiency is obtained.The efficiency does not deteriorate when replicaexchange calculations are carried out.
The parallel implementation of UNRES enables us to simulate the dynamics of quite large proteins in real time.For example, for the 263-residue 5HKQ protein, the time required for 10,000 time steps with 24 fine-grained tasks run on the Cray machine is 24 seconds, which translates to about 176 nanoseconds of molecular-dynamics time per day (24 hours of computations) with the 4.89 fs time step commonly set in UNRES runs.However, because UNRES simulation time is at least 1,000-fold extended with respect to that of all-atom simulations with explicit solvent [9], one day of UNRES simulations for this protein amounts to a 0.
angles γ (γ i has the axis passing trough C α i and C α i+1

Figure 2 .
Figure 2. Plots of the speed-ups for canonical single-trajectory UNRES/MD runs obtained for proteins with various chain length vs. the number of processors on various hardware platforms.The logarithmic scale with base 2 is used on both axes

Figure 3 .
Figure 3. Plots of the efficiencies for canonical single-trajectory UNRES/MD runs obtained for proteins with various numbers of residues vs. the number of processors on various hardware platforms.The logarithmic scale with base 2 is used on x-axis

Figure 4 .
Figure 4. Plots of the fraction of the collective-communication time to the non-setup wallclock time with a given number of cores in a single-trajectory UNRES/MD run for the 5HKQ protein on okeanos.hpc.icm.edu.pl,tryton-ap.gv.kdm.task.gda.pl,piasek4.chem.univ.gda.pl and nostromo.hpc.icm.edu.pl.The right panel (zoom) shows a higher-magnification view of the bottom left region in the left panel.The logarithmic scale with base 2 is used on x-axis

Figure 5 .
Figure5.Plots of non-setup times and efficiencies in 40,0000 step MREMD simulations with replica exchange each 10,000 steps, numbers of trajectories from 1 (reference) to 48, and various numbers of fine-grain tasks per trajectory obtained for the 1A6S (87-residue) protein with trytonap.gv.kdm.task.gda.pl.The logarithmic scale with base 2 is used on x-axis.For a given point in the graph, the total number of cores used is obtained by multiplying the number of trajectories by the number of fine-grained tasks 176 millisecond molecular dynamics.With this speed up, physics-based simulations of biologically important processes are now at hand.More informations about UNRES force field and the source code are available (Description of UNRES program.http://www.unres.pl/unres,accessed: 2018-06-15; Downloads page.http://unres.pl/downloads,accessed: 2018-06-15; UNRES server version.http://unresserver.chem.ug.edu.pl./about, accessed: 2018-06-15).from the National Science Centre of Poland (Narodowe Centrum Nauki).Computational resources were provided by (a) the Interdisciplinary Center of Mathematical and Computer Modeling (ICM) at the University of Warsaw under Grant No. GA71-23, (b) the Centre of Informatics -Tricity Academic Supercomputer & networK (CI TASK) in Gdańsk, and (c) our 796-processor Beowulf cluster at the Faculty of Chemistry, University of Gdańsk.This paper is distributed under the terms of the Creative Commons Attribution-Non Commercial 3.0 License which permits non-commercial use, reproduction and distribution of the work without further permission provided the original work is properly cited.