MOUSE2: Molecular Ordering Utilities for Simulations, Edition 2

c


Introduction
In recent decades, particle-based computer simulations have become a vital tool to advance scientific knowledge and get valuable insights into the behavior of complex systems [18].Simulations can be used for high-throughput screening, to check the influence of conditions that cannot be easily achieved experimentally, and are also important for validation of the models.
As most of the considered systems are chaotic [64], meaningful results cannot be obtained only from the visual analysis of the snapshots, but require considering a statistically representative ensemble of configurations [17].Therefore, the data from particle simulations has to be quantified, i.e. reduced to a relatively small set of comprehensible parameters.
Molecular ordering and self-assembly are the focus of computer simulations due to their great importance to many areas of research, including physics, chemistry, biology and medicine, and because computer simulations are the perfect tool to explore the relationships between molecular-level phenomena and practically important behavior of the systems.
Analysis of the results of particle simulations is implemented in multiple software tools, including those with open-source code.For example, one can use the algorithms included in popular simulation engines such as LAMMPS [57] or GROMACS [1] to calculate the required quantities in the course of the simulation or process the trajectories afterwards.In the case of post-processing, the range of possible tools is wider and includes multiple toolkits and libraries.Among the latter are MDAnalysis [45], VMD [37], Travis [10], PyLAT [36], Pteros [62] and MDTraj [44].However, in many cases, the data analysis is performed by the unpublished inhouse code, which hinders the reproducibility of the results.
In this manuscript we share a set of software tools which were used to describe the behavior of various complex polymer systems [2,[19][20][21][22][23][24][25][26][27], mostly representing the class of so-called conformationally asymmetric polymers.The parameters calculated by our tools exploit the ability of computer simulations to directly access the structure of the systems.This way, one can quantify the patterns that can be observed only in computer simulations and explore the relationship between the parameters of the system, its molecular structuring and its experimentally observed properties.This can reveal underlying processes responsible for complex phenomena observed in polymer systems.
The order parameters which are calculated by the proposed tools are independent of the atomistic details of the system and capture its general behavior.To analyze the results of detailed (e.g.atomistic) simulations, one can either select specific particles to represent the spatial positions of monomer units or apply a systematic coarse-graining procedure, i.e. consider the centers of mass of the units as their positions.The latter option can be implemented through available open-source coarse-graining software, for example, the VOTCA toolkit [54], or done within the MDAnalysis framework.
The order parameters are also designed to work without prior knowledge about the ordering in the system.For example, the estimation of helical parameters based on the autocorrelation function does not depend on the orientation of the helical fragments.The local ordering parameter and lamellar ordering parameter also do not depend on the direction of ordering in the system.However, some parameters, such as the backbone correlation distance or the local ordering cutoff range, are left to the user's discretion.It is worth noting that these spatial parameters can be efficiently employed to assess the scale of the spatial correlation in the systems and can be an important tool in assessing the soundness of the results in terms of size-related artifacts [17].
The utilities are written in Python3 [58], which is an easy-to-learn and flexible interpreted programming language.To make our code convenient for a wide range of researchers, we have refactored it and moved to use the popular and actively developed MDAnalysis library [45] to process the data input, providing an interface to input the data from many simulation packages, including LAMMPS, GROMACS, AMBER, HOOMD, Tinker, as well as data in the standard PDB format.The format of the data is recognized automatically by MDAnalysis based on filename extensions.The MDAnalysis also provides intrinsic support for processing of the time series.For the analysis of aggregation, we use the NetworkX library [32].The input parameters are processed by the Argparse library.The output is provided in standard Python dictionary structures with a short description and results for all of the timesteps.It can be printed out in JSON [52] data format, to be assessed directly or processed further.In some of the utilities, Matplotlib [38] is used to plot the results.
Computing local structural properties is challenging because they scale as O(N 2 ), where the number of particles N can be as large as 10 5 -10 6 [4].While some of the calculation-intensive algorithms could be reasonably implemented only using NumPy [34], we tried to optimize the code and use NumPy-based vectorisation wherever possible.For example, to calculate the list of neighbors for a selected atom, the distances between the atom and all of the other atoms are calculated, with the values exceeding the cutoff masked using the numpy.mamodule.The indices of the atoms that do not satisfy the neighbor criteria are then filtered out using the numpy.ma.compress function.Similar approach, implying vectorized computations with masking out the irrelevant values was used to calculate the bond autocorrelation function as well as the superhelical twist parameters.
In the following sections, we describe the individual algorithms and their applicability to structural analysis, and in the end, we describe the performance optimization techniques employed.

Algorithms 1.Backbone Bonds Autocorrelation and Helicity
Helix is one of the main ways of polymer packing which is widespread in synthetic and biological macromolecules [15,39,48].The helical packing combines the connectivity of the monomer units in the macromolecule with a high density [48].The helical symmetry can be determined by optimization of the contacts between the polymer chains [11,39], by steric limitations on the dihedral angles between the neighboring units along the backbone [20], and can emerge spontaneously [13,16].Many pure synthetic polymers, such as polylactic acid, crystallize in a helical form, which determines their physical and chemical properties.In biological macromolecules, the helical secondary structure serves as a basis for self-organization and determines the structuring at the higher levels [29,30,35,43,60].During the protein folding, the formation of the local structure, including the helical motifs, provides the entrance to the so-called "folding funnel", immensely reducing the entropy of the macromolecule [3,6,7].It was shown that local helical structure of a macromolecule can significantly affect the probability of knot formation [63].
The analysis of the secondary helical structures in the soft matter is complicated due to the local nature of the correlations.Therefore, the algorithm must be agnostic in terms of the orientation of the helical fragments, and the length of the analyzed correlations could be limited.
In [21] we have proposed the autocorrelation function of the backbone bonds to quantify the local helical ordering in the macromolecules: where d i designates the normalized vector, connecting the atoms of the i-th backbone bond, and N is the length of the backbone.From the analysis of the resulting autocorrelation functions, it is possible to determine the actual number of monomer units per helix turn and to evaluate the stability of the structure quantitatively. ( To evaluate the spatial parameters of the helical structure and its stability, one can fit C(k) with a function in the form to obtain the number of monomer units per turn and the persistence length 1/β.It is worth noting that the amplitude of the oscillations and the shift correspond to the squared cosine and sine of the helix angle respectively: The decay parameter β can be used as a measure of the stability of the helical structure.Alternatively, one can use the value of C(p), which corresponds to correlations between the backbone at the neighboring turns of the helix.The observed values can also be compared to the perfect crystalline form of the polymer, as was done in [28] for polylactic acid.
To use the implementation, the user can call the bond autocorrelations.pytool, provide the input data file, the maximum value of k, and specify whether the correlations between the bonds connecting the beads with different residue IDs shall be taken into account.With the --plot option the graph is shown, and with the --fit option, the curve is fitted with the SciPy [59] curve fit function, and the values of p, B and β from (3) are calculated.

Local Bond Orientations
Due to entanglement, the polymers can crystallize only partially, forming separate crystalline domains with different orientations.The degree of crystallinity together with the size of the domains and their mutual orientation can greatly affect the thermal, mechanical and barrier properties of the polymers.
In many biological systems, a complex local arrangement of the polymer chains is observed, when the backbones align at a certain non-zero angle to each other, forming a helix-coil structure or a helical multiplet.Such an arrangement can be caused by steric reasons [55], solvophobic [12] or electrostatic [40] interactions.Through these mechanisms, which make the basis of the socalled "chirality transfer", the local helicity can cause the twisting of the polymer filaments [50,61], which, in its turn, can limit the growth of the bundles composed of such filaments [30,61].
Therefore, studying the ordering effects at the molecular level can provide valuable insights.To establish the validity of computer simulations [5,17] it is important to compare the characteristic scale of the ordering with the size of the simulation cell.
To evaluate the local ordering, we have implemented the algorithm to calculate the angles χ between the bonds, when the distance between the midpoints of the bonds lies in the interval between the user-defined minimum r min and maximum r max distances.
To avoid the complexity, we suggest using the bonds in the molecular dataset as the vectors.In the case of polymers, these can be the vectors connecting the positions of 1-st or higherorder neighbors along the polymer chain.To analyze the macromolecules which have a helical structure, one can determine the vectors as the ones connecting the i-th and i + p-th monomer units, where p is the number of monomer units per helix turn, and p can be determined from bond autocorrelation analysis as described above.Alternatively, the ordering of helical macromolecules can be studied using the vectors, connecting the centers of mass of each p monomer units along the polymer backbone.
The values can then be used to calculate the 2-nd Legendre polynomial: which can serve as the order parameter.To calculate the spatial scale of ordering, one can check the dependence of the value of the parameter on the cutoff radius.The minimum reasonable value of the cutoff radius is determined by the density of the system and can be of the order of the first coordination sphere.To avoid the finite-size effects, the maximum value must not exceed half the size of the simulation cell.The values of the angles between the neighboring bonds and helical segments were used in [23] to investigate the mechanically induced ordering of polylactic acid, and in [2] to study the structuring of polymer globules.
To use the implementation, the user can call the local alignment.pytool, and provide the input data files, the minimum (--r min) and maximum (--r max) distance between the midpoints of the vectors which will be taken into account.With the --histogram option, the distributions of the parameters in terms of cos 2 χ, cosχ and χ, will be calculated.The distributions for cos 2 χ, normalized by the total area and by the solid angle (so that the distribution is uniform for the isotropic case) can be plotted by using the --plot option.Studying the distributions of χ can be insightful for many biological and biomimetic systems, where the neighboring helical segments prefer to align at a certain angle to each other [55].

Orientation of Copolymer Blocks in Lamellae
The procedure described in this section is applied to the systems of diblock copolymers with a stiff block which undergoes microphase separation into the lamellar phase.Microphase separation is an important phenomenon which occurs in melts and concentrated solutions of block copolymers with incompatible blocks and results in the formation of micro-and nanodomains of various shapes [8,33,42].This phenomenon finds its practical application in a wide range of nanostructured materials where the domains are endowed with predefined properties: those can be electronic or ionic conductivity, optical properties, etc. [41,53].
Block copolymers containing a stiff block and a flexible block demonstrate the properties of both flexible coil-coil diblock copolymers and liquid crystals.The self-assembly of such block copolymers results in novel unique morphologies distinct from the classic morphologies formed by coil-coil diblock copolymers [14,47] and can include various forms of lamellae.In particular, it was found that for block copolymers with stiff blocks the latter can be smectically or nematically ordered [46].However, the orientation of the blocks is meaningful in relation to the lamellar plane.To calculate the orientational order parameters of the stiff blocks relative to the normal vector of the lamellae plane, we have implemented the algorithm described in [49] and used it to study the ordering of helix-coil block-copolymers [19].
First, it should be taken into account that the lamellae can take random orientation in the simulation cell, so the procedure of determining the vector normal to the lamellar planes is applied.For each polymer chain, the vectors connecting the centers of mass of incompatible blocks are calculated, normalized to have a unit length and placed to the origin of coordinates.The resulting unit vectors are denoted as e i = (e ix , e iy , e iz )(|e i | = 1).We find the gyration tensor of those vectors as follows: In this equation, n ch is the number of polymer chains in the system, N stif f is the number of units in the stiff block.The eigenvalues and corresponding eigenvectors are calculated for this tensor.The lamellar normal vector k is denoted as the eigenvector which corresponds to the largest eigenvalue.
(a) Schematic representation of several molecules of the helix-coil diblock copolymer within the lamella, obtained in [19] (b) Coordinate system related to lamellar domains as proposed in [49] Figure 2. Schematic representation of the lamellar domain and the corresponding coordinate system.Blue arrows are the vectors connecting the centers of mass of the coil and helix block (in Fig. 2a) and the derived lamellar normal (in Fig. 2b).Green arrows are the vectors connecting the ends of the helix block (in Fig. 2a) and the director of these blocks (in Fig. 2b) The orientational order parameter L k is introduced as the average orientation of the stiff blocks with respect to the lamellar normal: Here, a is the unit vector showing the direction of each stiff block, i.e. the vector between the stiff block ends, and the averaging is performed over all the polymer chains.The modulus of L k tends to zero if the orientations of the stiff blocks are not correlated and is close to unity when the stiff blocks orient perpendicularly to the lamellae surface.
We also introduce the director n of all the stiff blocks.It is defined via the same procedure as k, using the vectors connecting the ends of the stiff blocks a i normalized to the unity length (|a i | = 1) instead of e i in (7).The plane comprising the vectors n and k is denoted as a tilt plane.We calculate the tilt angle θ of the stiff blocks using three order parameters: In (10) and (11), γ is the angle between the lamellar normal k and a i ; ϕ is the angle between the vector c, which belongs to the tilt plane and is perpendicular to the lamellar normal (c ⊥ k), and the projection of the vector a i on the plane formed by the vector c and the normal to the tilt plane h (Fig. 2 and ref. [49]).The parameter P k describes the biaxial distribution of the stiff blocks within the lamellae; V is a coefficient in the non-diagonal term of the tensor order parameter of the system [49] and describes the tilt of its main axis with respect to the lamellar normal.The algorithm is implemented in lamellar alignment.pytool.The user should provide an input data file and, optionally, the names of the block types.
The output of the calculation includes the order parameter L k which is the measure of the correlation between the directions of the stiff blocks with each other and with the lamellar plane and the tilt angle θ of the stiff blocks.

Superhelical Twist
The formation of superhelical structures is common in the self-organization of biological macromolecules.The local helical structure of polymer chains can lead to a preferential alignment angle between the neighboring polymer chains, as discussed in section 1.2.In helical multiplets, this causes the twisting of the helical polymer chains around each other.The spatial dimensions of this type of structure depend on the spatial parameters of individual helices [31].To assess the handedness of these superhelices, one can connect uniformly points of the helical tube with the vectors and calculate the dihedral angles formed by these vectors in the range (−π; π].If the length of the vectors is commensurate to the period of the superhelix, the dihedral angles will have preferentially negative values, corresponding to the left-hand superhelices, or positive values, corresponding to right-hand superhelices [2].The algorithm quantitatively supported the visual observations reported in [2], and the results were presented in the Supporting Information of [2]. For a set of the integer values of k, provided by the user, the algorithm calculates the dihedral angles between the triplets of consecutive vectors connecting the monomer units of the backbone: (r i , r i+k ), (r i+k , r i+2k ), (r i+2k , r i+3k ), where r j is the position of the j-th monomer unit.By default, the dihedral angle values are calculated using those values of i, where the start points and end points of all three vectors belong to the same molecule.
To calculate the values of the described dihedral angles, the user can call the backbone twist.pyutility, and provide the list of the values of k, and the criteria for backbone atom selection in terms of MDAnalysis select atoms function.With the --plot option, the distributions for all of the values of k will be plotted using the NumPy histogram function and PyPlot.

Aggregation
The algorithm calculates the list of the aggregates at each timestep.The atom is defined as belonging to a particular aggregate if it has at least one neighboring atom belonging to the same aggregate.
At a given timestep, a list of neighbors for each atom is created.The atoms are considered neighbors if they lie within the cutoff distance from each other.The user can switch between the distance matrix calculation function of the MDAnalysis library or our algorithm based on NumPy (see the Performance Optimization section).The algorithm then loops through the atoms and for each atom loops through its neighbors, adding the pairs of the atoms and their neighbors as the edges of the NetworkX graph object [32].
The iterative deepening depth-first search algorithm is then applied to the graph [56].The atom indices returned by this search are then stored as an individual aggregate, and removed from the list of all atom indices.The procedure is repeated until the list of all atom indices is empty.
The function can be called using aggregates.pytool with the data file and the cutoff distance as input.The output contains lists of the aggregates in a form of the lists of atoms in each aggregate.Such lists are generated for each timestep.

Performance Optimization
The performance of the toolkit is achieved by employing NumPy vectorization for computationally intensive operations.For example, to calculate the list of neighbors for a selected atom, the distances between the atom and all of the other atoms are calculated.The resulting squared distances are then compared to the maximum distance between the neighbors, and the mask for the numpy.mamodule is generated.The indices of the atoms that do not satisfy the neighbor criteria are then filtered out using the numpy.ma.compress function.
To vectorize the calculation of the bond autocorrelation functions, two arrays of bond vector coordinates are calculated, which are shifted by the autocorrelation distance k: b1 = np .pad ( b , ( ( 0 , k ) , ( 0 , 0 ) ) , c o n s t a n t v a l u e s = 1 .) b2 = np .pad ( b , ( ( k , 0 ) , ( 0 , 0 ) ) , c o n s t a n t v a l u e s = 1 .) The residue identifiers, associated with the bonds, are then also padded, to check whether the result is meaningful, and the autocorrelation function is calculated for all of the bonds: / np .l i n a l g .norm ( b1 , a x i s = 1 ) / np .l i n a l g .norm ( b2 , a x i s = 1 ) ) To calculate the dihedral angles associated with the superhelical twist, a similar scheme is performed first to calculate the vectors forming the dihedral angles, and then the values of the dihedral angles themselves.

Usage Examples
The examples directory of the project repository currently contains two Python scripts, which demonstrate using some of the functions.The systems for analysis are taken from the RSCB Protein Data Bank (PDB) [9].We use the bond autocorrelations.pytool to detect the parameters of the helices in a de novo protein with the PDB identifier 3H5G [51].The script invokes the bond autocorrelations.pycommand through the os.system() call, corresponding to the command line usage scenario.The bond autocorrelations.pyfinds the average number of residues per helix turn as p = 3.45, which is close to the values for the α-helix, and the positive shift of the autocorrelation function, which depends on the elongation of the helix, as b = 0.54.
The very low value of the decay parameter β = −0.01 is indicative of a well-defined helical structure, which is consistent with the authors' conclusions about its stability.For the second protein from the PDB, with the identifier 6OCK [65], the example script invokes the local alignment.pyfunction, corresponding to the scenario where the user can build her/his own tools upon the functions provided by our package.The script calculates the average order parameter s, corresponding to the alignment of the helices, depending on the distance between their midpoints.The helices are taken according to the information provided in the original data bank record, and only those helices that are equal or longer than 10 residues are taken into account.
The distributions of the angles between the helices are then calculated for the main maximum at r = 8 and a small maximum at r = 14.The plots show that the helices located closely and belonging to the same domain are mostly aligned with each other, while at larger distances there is a wide distribution of mutual orientations of the helices, which can result in fluctuations of the average local alignment parameter s, leading to local extrema.

Availability and System Requirements
GitHub page of the project.The unit tests for the algorithms are provided with the package and are located in the tests directory of the GitHub repository.We have developed, tested and used the software with the Linux platforms used by our group, and have not tested the installation and functioning of the software on other platforms.In case of the Windows operating system, the installation shall be possible using the PyPI and/or Conda package managers, and we will be happy to cooperate with the users to make the software work on other platforms.

Conclusions
Our package offers several tools for the analysis of molecular dynamics simulations and is available for download from GitHub under the GNU General Public License, and from PyPI.With the help of MDAnalysis library, it can process the output of multiple simulation software.The algorithms were optimized for performance and with the modern hardware can process the systems containing up to 10 6 particles.
Although it is impossible to implement all of the usage scenarios, the tools can be extended and the functions can be called from the Python code.We are open to cooperation and will consider implementing other usage scenarios.
We are planning to build on these utilities and add more functionality in the course of our future research.

Figure 1 .
Figure 1.The bond vectors in a helix with an even number of monomer units per turn i f not d i f f e r e n t m o l e c u l e s : # Pad t h e r e s i d u e i d a r r a y s r e s i d 1 = np .pad ( b o n d r e s i d s , ( 0 , k ) , c o n s t a n t v a l u e s = 0 ) r e s i d 2 = np .pad ( b o n d r e s i d s , ( k , 0 ) , c o n s t a n t v a l u e s = 0 ) # Take i n t o a c c o u n t o n l y m o l e c u l e s w i t h same r e s i d u e i d v a l i d = np .l o g i c a l a n d ( v a l i d , np .e q u a l ( r e s i d 1 , r e s i d 2 ) ) # Mask i s True f o r t h e v a l u e s t h a t a r e not v a l i d mask = np .l o g i c a l n o t ( v a l i d ) # C a l c u l a t e t h e c o r r e l a t i o n v a l u e s f o r a l l t h e bonds c = ( np .sum ( np .m u l t i p l y ( b1 , b2 ) , a x i s = 1 ) (a) Autocorrelation function C(k) of the helical fragments in 3H5G [51] molecules (b) Average order parameter s of the helical fragments alignment in 6OCK [65] depending on the distance r between the midpoints of the helices (c) Distributions of the angles between the helical fragments in 6OCK [65] at r = 8 and r = 14

Figure 3 .
Figure 3.Results for processing example proteins with helical structures from the RSCB PDB[9]